Goto

Collaborating Authors

 failure diagnosis


Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

arXiv.org Artificial Intelligence

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/


ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

arXiv.org Artificial Intelligence

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.


TelOps: AI-driven Operations and Maintenance for Telecommunication Networks

arXiv.org Artificial Intelligence

Telecommunication Networks (TNs) have become the most important infrastructure for data communications over the last century. Operations and maintenance (O&M) is extremely important to ensure the availability, effectiveness, and efficiency of TN communications. Different from the popular O&M technique for IT systems (e.g., the cloud), artificial intelligence for IT Operations (AIOps), O&M for TNs meets the following three fundamental challenges: topological dependence of network components, highly heterogeneous software, and restricted failure data. This article presents TelOps, the first AI-driven O&M framework for TNs, systematically enhanced with mechanism, data, and empirical knowledge. We provide a comprehensive comparison between TelOps and AIOps, and conduct a proof-of-concept case study on a typical O&M task (failure diagnosis) for a real industrial TN. As the first systematic AI-driven O&M framework for TNs, TelOps opens a new door to applying AI techniques to TN automation.


Decentralized Failure Diagnosis of Stochastic Discrete Event Systems

arXiv.org Artificial Intelligence

Recently, the diagnosability of {\it stochastic discrete event systems} (SDESs) was investigated in the literature, and, the failure diagnosis considered was {\it centralized}. In this paper, we propose an approach to {\it decentralized} failure diagnosis of SDESs, where the stochastic system uses multiple local diagnosers to detect failures and each local diagnoser possesses its own information. In a way, the centralized failure diagnosis of SDESs can be viewed as a special case of the decentralized failure diagnosis presented in this paper with only one projection. The main contributions are as follows: (1) We formalize the notion of codiagnosability for stochastic automata, which means that a failure can be detected by at least one local stochastic diagnoser within a finite delay. (2) We construct a codiagnoser from a given stochastic automaton with multiple projections, and the codiagnoser associated with the local diagnosers is used to test codiagnosability condition of SDESs. (3) We deal with a number of basic properties of the codiagnoser. In particular, a necessary and sufficient condition for the codiagnosability of SDESs is presented. (4) We give a computing method in detail to check whether codiagnosability is violated. And (5) some examples are described to illustrate the applications of the codiagnosability and its computing method.